{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "136a4efe-fb99-4311-8679-e0a5b6282755",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"http://mng.bz/orYv\">Build a Large Language Model From Scratch</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/LLMs-from-scratch\">https://github.com/rasbt/LLMs-from-scratch</a>\n",
    "<br>汉化的库: <a href=\"https://github.com/GoatCsu/CN-LLMs-from-scratch.git\">https://github.com/GoatCsu/CN-LLMs-from-scratch.git</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"http://mng.bz/orYv\"><img src=\"https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1910a06-e8a3-40ac-8201-ff70615b1ba4",
   "metadata": {
    "tags": []
   },
   "source": [
    "# 为指令数据集创建“被动语态”数据库"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a128651b-f326-4232-a994-42f38b7ed520",
   "metadata": {},
   "source": [
    "- 如下所示,我们用gpt-4来创建被动语态\n",
    "\n",
    "```python\n",
    "{  \n",
    "   'instruction': 'Identify the verb in the following sentence',\n",
    "   'input': 'The cat sleeps on the couch.',\n",
    "   'output': 'The verb in the sentence is \"sleeps.\"',\n",
    "   'output_2': 'The sentence is \"sleeps.\"'   #  <---- 新创建的条目\n",
    "}  \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "267ba0d1-b884-42df-85bd-0be746fd47a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install -r requirements-extra.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "63610acc-db94-437f-8d38-e99dca0299cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "openai version: 1.30.3\n",
      "tqdm version: 4.65.0\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "pkgs = [\"openai\",  # OpenAI API\n",
    "        \"tqdm\",    # 进度条\n",
    "       ]\n",
    "\n",
    "for p in pkgs:\n",
    "    print(f\"{p} version: {version(p)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bcdcb34-ac75-4f4f-9505-3ce0666c42d5",
   "metadata": {},
   "source": [
    "## Test OpenAI API"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9558a522-650d-401a-84fc-9fd7b1f39da7",
   "metadata": {},
   "source": [
    "- 首先，让我们测试一下OpenAI API是否已正确设置\n",
    "- 如果您还没有账户，需要在 https://platform.openai.com/ 创建一个\n",
    "- 请注意，您还需要向账户中转账，因为GPT-4 API不是免费的（请参见 https://platform.openai.com/settings/organization/billing/overview）\n",
    "- 使用本笔记本中的代码创建约200个被动语态条目大约需要花费0.13美元（13美分）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89343a84-0ddc-42fc-bf50-298a342b93c0",
   "metadata": {},
   "source": [
    "- 首先，我们需要提供OpenAI API的密钥，您可以在 https://platform.openai.com/api-keys 找到该密钥\n",
    "- 请确保不要与任何人共享此密钥\n",
    "- 将此密钥（`\"sk-...\"`）添加到此文件夹中的 `config.json` 文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "26900564-aba7-48ba-8ee8-6cc9a505a25c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from openai import OpenAI\n",
    "\n",
    "# 从 JSON 文件中加载 API 密钥. \n",
    "# 请确保将 \"sk-...\" 替换为您在 https://platform.openai.com/api-keys 上的实际 API 密钥\n",
    "with open(\"config.json\", \"r\") as config_file:\n",
    "    config = json.load(config_file)\n",
    "    api_key = config[\"OPENAI_API_KEY\"]\n",
    "\n",
    "client = OpenAI(api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16642a48-1cab-40d2-af08-ab8c2fbf5876",
   "metadata": {},
   "source": [
    "- 首先，让我们通过一个简单的示例来测试API，确保它按预期工作："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "08e9ef2e-e816-4283-840e-43625791ad33",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Breakfast was eaten by me.'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def run_chatgpt(prompt, client, model=\"gpt-4-turbo\"):\n",
    "    response = client.chat.completions.create(\n",
    "        model=model,\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "        temperature=0.0,\n",
    "    )\n",
    "    return response.choices[0].message.content\n",
    "\n",
    "\n",
    "# 准备输入\n",
    "sentence = \"I ate breakfast\"\n",
    "prompt = f\"Convert the following sentence to passive voice: '{sentence}'\"\n",
    "run_chatgpt(prompt, client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "162a4739-6f03-4092-a5c2-f57a0b6a4c4d",
   "metadata": {},
   "source": [
    "## 创建json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca011a8b-20c5-4101-979e-9b5fccf62f8a",
   "metadata": {},
   "source": [
    "- 导入想要优化的json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8b2d393a-aa92-4190-9d44-44326a6f699b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of entries: 200\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "json_file = \"instruction-examples.json\"\n",
    "\n",
    "with open(json_file, \"r\") as file:\n",
    "    json_data = json.load(file)\n",
    "    \n",
    "print(\"Number of entries:\", len(json_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39a55283-7d51-4136-ba60-f799d49f4098",
   "metadata": {},
   "source": [
    "- 小试牛刀,确保API可用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "735cc089-d127-480a-b39d-0782581f0c41",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Input:\n",
      ">> The verb in the sentence is \"sleeps.\"\n",
      "\n",
      "Output:\n",
      ">> The sentence is \"sleeps.\"\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Input:\n",
      ">> The plural form of \"goose\" is \"geese.\"\n",
      "\n",
      "Output:\n",
      ">> The plural form of \"goose\" is referred to as \"geese.\"\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Input:\n",
      ">> The three primary colors are red, blue, and yellow.\n",
      "\n",
      "Output:\n",
      ">> Red, blue, and yellow are considered the three primary colors.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Input:\n",
      ">> They had finished the game.\n",
      "\n",
      "Output:\n",
      ">> The game had been finished by them.\n",
      "\n",
      "-------------------------\n",
      "\n",
      "Input:\n",
      ">> The abbreviation for \"Doctor of Philosophy\" is Ph.D.\n",
      "\n",
      "Output:\n",
      ">> The abbreviation \"Ph.D.\" is used for \"Doctor of Philosophy\".\n",
      "\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "for entry in json_data[:5]:\n",
    "    text = entry[\"output\"]\n",
    "    prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
    "    \n",
    "    print(\"\\nInput:\")\n",
    "    print(\">>\", text)\n",
    "    print(\"\\nOutput:\")\n",
    "    print(\">>\", run_chatgpt(prompt, client))\n",
    "    print(\"\\n-------------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "142dfaa7-429f-4eb0-b74d-ff327f79547a",
   "metadata": {},
   "source": [
    "- 拓展代码也拓展数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "4f700d4b-19e5-4404-afa7-b0f093024232",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.23it/s]\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm  # 一个进度条工具\n",
    "\n",
    "\n",
    "for i, entry in tqdm(enumerate(json_data[:5]), total=len(json_data[:5])):\n",
    "    text = entry[\"output\"]\n",
    "    prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
    "    json_data[i][\"output_2\"] = run_chatgpt(prompt, client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd144282-0596-4e9b-9815-322cff34b400",
   "metadata": {},
   "source": [
    "- 确保输出合理有效"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5b6eaa87-a86d-42a1-a20a-b764b0d559d4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'instruction': 'Identify the verb in the following sentence: The cat sleeps on the couch.',\n",
       " 'input': '',\n",
       " 'output': 'The verb in the sentence is \"sleeps.\"',\n",
       " 'output_2': 'The sentence is \"sleeps.\"'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json_data[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6970e8cf-2b18-4e3d-9f25-e6a4489c39a7",
   "metadata": {},
   "source": [
    "- 万事俱备,运行下程序"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "eef99407-8ffd-4a63-b7ab-ffe30c0f0677",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████████████████████████████████████████████████████████████| 200/200 [03:43<00:00,  1.12s/it]\n"
     ]
    }
   ],
   "source": [
    "for i, entry in tqdm(enumerate(json_data), total=len(json_data)):\n",
    "    text = entry[\"output\"]\n",
    "    prompt = f\"Without adding any response or explanation, convert the following text to passive voice: {text}\"\n",
    "    json_data[i][\"output_2\"] = run_chatgpt(prompt, client)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac91ae85-2f0e-456a-be1d-56e1958f30d8",
   "metadata": {},
   "source": [
    "- 测试结束,保存模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "330cc30a-b08e-4bf0-bee2-bec0da4208de",
   "metadata": {},
   "outputs": [],
   "source": [
    "new_json_file = json_file.replace(\".json\", \"-modified.json\")\n",
    "\n",
    "\n",
    "with open(new_json_file, \"w\") as file:\n",
    "    json.dump(json_data, file, indent=4)  # \"indent\"设置为了更美观的展示"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
